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Abstract 

Resources in a distributed system can be identified using identifiers 
based on random numbers. When using a distributed hash table to resolve 
such identifiers to network locations, the straightforward approach is to 
store the network location directly in the hash table entry associated with 
an identifier. When a mobile host contains a large number of resources, 
this requires that all of the associated hash table entries must be updated 
when its network address changes. 

We propose an alternative approach where we store a host identifier 
in the entry associated with a resource identifier and the actual network 
address of the host in a separate host entry. This can drastically reduce 
the time required for updating the distributed hash table when a mobile 
host changes its network address. We also investigate under which cir- 
cumstances our approach should or should not be used. We evaluate and 
confirm the usefulness of our approach with experiments run on top of 
OpenDHT. 

1 Introduction 

A distributed system needs a way to identify computing resources that it uses. 
A common way to identify them is to use a URL JM , which comprises a network 
address for the host and a path within the host0 However, some authors have 
argued that resource identifiers should contain little or no information, since 
otherwise an identifier would become invalid whenever there are changes in 
network locations, storage locations, naming policy, organization, etc. [51 H51[T^] 
Using random numbers for resource identification avoids including such in- 
formation in the identifier. It also makes it easy to allocate identifiers without 
having to manage the identifier space carefully, since identifier duplication is 



1 In http : //example . invalid/1/2/3/, for example, the network address for the host would 
be example . invalid while /1/2/3/ would be the path within the host. 
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virtually impossible with a large enough identifier spaced However, unique 
identifiers based on random numbers need to be resolved to actual network lo- 
cations, which are often composed of a host address and a path within the host, 
in order for the represented resource to be used. One way to resolve such iden- 
tifiers is to use distributed hash tables [TJ [T7] . Most distributed hash tables are 
scalable and are well suited for storing data indexed by random identifiers. 

The most straightforward way to index a resource in a distributed hash table 
is to simply store an entry in the table with the resource identifier as the key and 
the network location of the resource as the value. However, this results in per- 
formance and latency issues when a mobile host contains many resources, since 
the network location in each entry must be updated independently whenever 
the host moves. 

For data, this problem can be alleviated by using replication [16]. However, 
there are many cases when replication is not a feasible option, some of which 
are listed below: 

• Owners of nodes in a distributed hash table may not be willing to con- 
tribute large amounts of storage to store data for other people. For exam- 
ple, while they may be willing to store network locations of home videos, 
they may not be be willing to store the video files themselves. 

• An identifier needs to identify a specific master copy of a file in a mobile 
host in order to ensure that updates to the file are immediately available 
to the host. 

• The resource in question is inherently not replicable. For example, it could 
be a network service or sensor specific to a mobile host. 

Instead of replication, an alternative approach is to use Mobile IP [T^] or 
the Host Identity Protocol [?] to preserve the network address of mobile hosts. 
This requires support in the operating system and the network infrastructure. 
Such support is not widespread, however, so this approach may not be desirable 
for applications using distributed hash tables. 

In this paper, we propose the use of indirect entries in a distributed hash 
table. An indirect entry contains a host identifier and a host-specific path. A 
host identifier is a random number which identifies a specific mobile host and is 
the key to a host entry in the distributed hash table. The network address of 
the mobile host is obtained from the host entry, which gives the actual network 
location when combined with the host-specific path. When a mobile host moves, 
only its host entry needs to be updated. 

The remainder of the paper is organized as follows. We discuss related work 
in section [5J Section [3] describes our proposal and discusses the circumstances 
under which it should or should not be used. We evaluate it against the straight- 
forward approach in section [4] and conclude in section [5] 

2 Given a 160-bit identifier space, the probability that even a single duplicate identifier 
would be generated over a 100-year time period with a billion identifiers being generated per 
second is about 10 -12 . 
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2 Related work 



The use of random numbers for globally unique identifiers is not uncommon, 
which takes advantage of the fact that the probability of two different resources 
being assigned the same random number is extremely low for a large enough 
identifier space. For example, X.667 defines random- number-based UUIDs [2Tj . 
while SPKI/SDSI uses hashes of public keys, which for identification purposes 
are similar to random numbers [6 . 

Ballintijn et al. argue that resource naming should be decoupled from re- 
source identification 2 . Resources are named with human- friendly names, 
which are based on DNS [10], while identification is done with object handles, 
which are globally unique identifiers that need not contain network locations. 
They use DNS to resolve human-friendly names to object handles and a location 
service to resolve object handles to network locations. The location service uses 
a hierarchical architecture for resolving object handles. This two-level approach 
allows the naming of resources without worrying about replication or migration 
and the identification of resources without worrying about naming policies. 

Walfish et al. argue for the use of semantic-free references for identifying web 
documents instead of URLs [19] . The reason is that changes in naming policies 
or ownership of DNS domain names often result in previous URLs pointing to 
unrelated or non-existent documents, even when the original documents still 
exist. Semantic-free references are hashes of public keys or other data, and 
are resolved to URLs using a distributed hash table based on Chord [T7|. Us- 
ing semantic-free references would allow web documents to link to each other 
without worrying about changes in the URLs of the documents. 

Distributed hash tables, also called peer-to-peer structured overlay networks, 
are distributed systems which map a uniform distribution of identifiers to nodes 
in the system [TJ [T7] [22]. Nodes act as peers, with no node having to play a 
special role, and a distributed hash table can continue operation even as nodes 
join or leave the system. Lookups and updates to a distributed hash table are 
scalable, typically taking time logarithmic to the number of nodes in the system. 
We experimentally evaluated our work using OpenDHT |15j . which is a public 
distributed hash table service based on Bamboo [14]. 

There has also been research on implementing distributed hash tables on top 
of mobile ad hoc networks [8] [9] . As with Mobile IP [12] and HIP [?] , hosts in 
mobile ad hoc networks do not change their network address with movement, 
so there would be no need to update entries in a distributed hash table used for 
resolving resource identifiers. However, almost the entire Internet is not part of 
a mobile ad hoc network, so it is of little help to applications that need to run 
on current networks. 

3 Batch update for mobile hosts 

The most straightforward way to map a unique identifier to an actual network 
location is to store a direct entry in the distributed hash table for each identifier, 
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Figure 1: Representation of when identifiers arc mapped directly to resources in 
a single mobile host. Circles denote identifiers, diamonds denote direct entries 
in the hash table, and squares denote network locations. All of the hash table 
entries must be updated whenever the mobile host moves. 

with the identifier as the key and the actual network location as the value. Res- 
olution is done by simply looking up the identifier in the distributed hash table 
and using the resulting network location. Figure Q] illustrates this approach. 

When the network address of a mobile host changes, this approach requires 
that entries for every resource contained by the host be updated independently. 
For a distributed hash table consisting of a constant number of nodes, this 
requires time linear to the number of resources in the mobile host. When the 
host contains a large number of resources, this can result in an unacceptably 
large delay before identifiers can be resolved to their updated location. 

Instead of storing the network location directly in a hash table entry for 
a resource identifier, we propose the alternative approach of storing both an 
location-independent identifier for the mobile host and the host-specific path in 
an indirect entry. The host identifier identifies the mobile host which contains 
the resource and are random numbers as in ordinary resource identifiers. The 
distributed hash table contains a host entry which maps this identifier to the 
network address of the mobile host. The path identifies the specific resource 
within the host. 

In this approach, we first find the corresponding indirect entry for the given 
resource identifier in the distributed hash table. Once the indirect entry is 
found, we find the corresponding host entry for the included host identifier. We 
then combine the network address of the host in the host entry and the path 
in the indirect entry to construct the network location of the desired resource]! 
This requires two lookups to the distributed hash table, compared to a single 
lookup required for direct entries. 

However, updating the distributed hash table when a mobile host changes 
its network address is much more efficient when using indirect entries compared 

3 When network locations are given as a HTTP URL, the network address of the host would 
be an IP address and the port number, while the path would simply be the URL path. 
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Figure 2: Representation of when identifiers are mapped indirectly to resources 
in a single mobile host. Circles denote identifiers, diamonds denote indirect 
entries in the hash table, the triangle denotes the host entry, and squares denote 
network locations. Only the host entry needs to be updated whenever the mobile 
host moves. 

to using direct entries. Unlike with direct entries, where every entry must be 
updated independently, only a single host entry needs to be updated when using 
indirect entries. This can greatly reduce the delay during which resource iden- 
tifiers cannot be resolved to their correct network locations. Figure [2] illustrates 
the approach using indirect entries. 

3.1 Using direct and indirect entries together 

If a host contains only a very small number of resources or almost never changes 
its network address, then using direct entries would be more efficient because of 
the smaller lookup overhead. On the other hand, using indirect entries drasti- 
cally reduces the update latency for a mobile host which contains a large number 
of resources and changes its network address frequently. Fortunately, both types 
of entries can be used simultaneously in a single distributed hash table. 

Entries in the distributed hash table can be prepended by a magic number 
which identifies the type of entry they are. The magic numbers are used to 
distinguish among direct, indirect, and host entries. They also serve to prevent 
potential conflicts when the same distributed hash table is used for other ap- 
plications besides resource identifier resolution. Table [T] shows the entry types 
and their contents, while figure [3] describes the resolution procedure. 
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Type 



Content 



Direct entry 
Indirect entry 
Host entry 



MD, network location 
MI, host identifier, path 
MH, host network address 



Table 1: Entry types and their contents. MD, MI, MH are magic numbers for 
direct, indirect, and host entries, respectively. 



1. Find entry indexed by the resource identifier in the dis- 
tributed hash table. 

2. If entry is direct entry, return with included network lo- 
cation. 

3. If entry is indirect entry, 

(a) Find host entry indexed by the included host identi- 
fier. 

(b) Combine network address of host in the host entry 
and the path of the resource in the indirect entry to 
construct the network location of the resource. 

(c) Return with the network location. 

4. Otherwise, return that the resource cannot be found. 



Figure 3: Resource identifier resolution procedure. This procedure does not 
treat a host as a resource (extending it so that it does is trivial). 
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Figure 4: Types of delays during concurrent get operations. c g is the minimum 
delay between the operations, while c r is the delay due to network latency. 

3.2 When to use direct or indirect entries 

A host can choose whether to use direct or indirect entries for its resources 
depending on which approach performs better for its needs. But under which 
circumstances should the host choose which approach? This section discusses 
this in terms of lookup overhead and update latency. 

Since get and put operations to the distributed hash table can be pipelined, 
where multiple operations may be handled concurrently as in figure @J we will 
consider the time costs c g or c p for an individual get or put operation sepa- 
rately from the fixed time costs c r or c q due to network latency in the get or 
put operation, which not only comes from accessing the distributed hash table 
externally but also from the communication among the distributed hash table 
nodes. We will assume that the number of nodes in the distributed hash table 
is constant so that these costs are also essentially constant. 

When using direct entries, all entries referencing resources in a given host 
must be updated independently. With n resources in a host, the migration time 
c m ,d required to update all of the entries when it changes its network address is 



On the other hand, only a single host entry needs to be updated when using 
indirect entries, so the migration time c m ,i in this case is 



A direct entry requires only a single get operation to resolve an identifier, 
whereas an indirect entry requires that it get the indirect entry first and then 
obtain the appropriate host entry (the two get operations cannot be done con- 
currently since the second operation is done based on the result from the first 
operation), so the respective lookup times ci t d and c^; are 

Cl.d = c g + Cr 
Ci,j = 2(c g + Cr) 

If there are ri lookups per unit time and r m migrations per unit time, then 
the overall time costs Cd and C, per unit time when using direct and indirect 
entries, respectively, are 



Cm.d — TICp -\- Cq 



(1) 



,i " T" Cq 



(2) 



C d = nci,d + r, 



Cm,d = Tl{c g + Cr) + r m (ncp + c q ) 
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Ci — TiCi_i + r rn c rn .i — 1r\{c,g + c r ) + r m (c p + c q ) 

When minimizing the overall time cost, it is better to use indirect entries 
when 

Ci < Cd 
ri(c g + c r ) < r m (n-l)c p 
n ^ (n - l)cp 
r m Cg -r Of 

One may also wish to give more weight to reducing migration times or lookup 
times. If we set the weights w m and wi by how much importance we attach to 
reducing migration times or lookup times, respectively, we can compute the 
weighted time costs as 

C' d = wmci.d + W m r m Cm,d 

C'i = w X TiC^i + w m r m c„ hi 
and then indirect entries should be used when 

wi n < (n - l)c p ^ 

T'm T" C r 

Assuming a large n, with W denoting the relative importance of reducing 
lookup times compared to migration times and R denoting how often lookups 
occur compared to migrations, equation © can be approximately rewritten as 

WR < (4) 

Cg ~\~ C>p 

Equation ((4]) agrees with our intuition that direct entries should be used 
when migration times do not matter or when migrations are rare, and that 
indirect entries should be used when migration times do matter and happen 
often for mobile hosts with a large number of resources. It also gives a concrete 
forumula for deciding whether to use direct or indirect entries. 



4 Evaluation 

In order to evaluate how using direct and indirect entries perform in a real 
network, we conducted experiments on OpenDHT [15j . OpenDHT is a public 
distributed hash table service which runs on about 200 nodes in PlanetLab [?]. 
We used the service by selecting a single gateway to the distributed hash ta- 
ble and accessing it with the XML-RPC 20| interface throughout our experi- 
ments. Here the mobile host is not part of the distributed hash table, similar 
to how clients are separate from the distributed hash tables in SFR [19] and 
CoDoNS [13] Figure [5] illustrates our experimental setup. 

4 In cases where mobile hosts are part of the distributed hash table, the overhead for 
updating the routing tables should also be considered. 
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OpenDHT 




Figure 5: Experimental setup. The mobile host, which can move around and 
change its network address, accesses OpenDHT through a gateway which is one 
of the nodes in the distributed hash table. 





Lookup time (s) 


Direct 


0.53 ±0.63 


Indirect 


1.13 ±0.60 



Table 2: Lookup times and their standard deviations. 

We compared lookup times and migration times when using direct entries 
and indirect entries for a single host. The host, a 2.16 GHz Intel Core 2 Duo with 
1GB of memory connected to the Internet via Ethernet, was migrated between 
two network addresses. Resource identifiers were mapped to URLs that point 
to files. A URL was stored directly in a direct entry, while only the URL path 
and a host identifier was stored in an indirect entry, with the IP address of the 
host being stored in a host entry. 

We first measured the lookup times for resolving an identifier to a URL. Since 
lookup for a direct entry requires exactly a single get operation and lookup for an 
indirect entry requires exactly two get operations, lookup times do not depend 
on the number of resources in a host@ Thus we measured the average lookup 
times required by direct and indirect entries by first inserting entries for 5000 
resource identifiers into the distributed hash table and then resolving randomly 
selected identifiers 2000 times for each case. As expected, the average lookup 
time for indirect entries was roughly twice that of direct entries as can be seen 

5 The time for a get operation should be constant for a distributed hash table with a fixed 
number of nodes, since network latency dominates the time. 
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Figure 6: Migration times for up to 100 resources in the mobile host. The error 
bars denote the standard deviation for each case. 



in tabled 

Next, we measured the migration times when using direct or indirect entries 
with varying numbers of resources in the host. For each number of resources, 
we first put in the entries for each resource into the distributed hash table. We 
then migrated the host 100 times and measured the average migration time. 

When updating direct entries, 100 entries were updated concurrently, which 
is much faster than updating each entry one by one. Using significantly larger 
amounts of concurrency was problematic because the gateway to OpenDHT had 
problems handling the number of connections. 

Also, we selected the entry with the largest remaining time-to-live value 
when retrieving entries from OpenDHT. This entry is the one that is most up- 
to-date since we used a fixed TTL value for all entries. We did not have to 
worry about individual entries becoming large enough to skew the result^ since 
we alternated the host between only two network addresses. 

Our results for the migration times are shown in figures [3 and where 
the host contained up to 100, 1000, and 5000 resources, respectively. We can see 
that migrating direct entries takes time linear to the number of entries in the 
host, as is expected from equation ([I]). On the other hand, the time required for 
updating indirect entries is essentially constant, as is expected from equation ([2|). 

While migration with indirect entries took only about a second, migration 
with direct entries took over 4 minutes and a half with 5000 resources contained 
in the host. Even with only 10 resources, it took about 9 seconds longer to up- 
date direct entries compared to the one second it takes to migrate with indirect 
entries, which can be a significant difference for interactive applications. 

6 OpenDHT returns all unexpired values that have been associated with a key. 
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Figure 7: Migration times for up to 1000 resources in the mobile host. The 
error bars denote the standard deviation for each case. 
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Figure 8: Migration times for up to 5000 resources in the mobile host. The 
error bars denote the standard deviation for each case. 
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Mobile hosts could contain even more resources than what was tried in our 
experiments. For example, the home directory in a personal machine of one 
of the authors contains more than 60,000 files. A straightforward extrapola- 
tion from our results suggests that this case would require almost an hour for 
migration when using direct entries. 

These results show that the drastic reduction in migration time by using 
indirect entries over direct entries can be worth the small increase in lookup 
time required when resolving indirect entries. 

5 Conclusions 

When identifying resources in a distributed system using identifiers based on 
random numbers, the most straightforward way to resolve identifiers with a 
distributed hash table is to store the network location directly in the entry 
keyed by the identifier. However, when a mobile host which contains multiple 
non-replicable resources changes its network address, all of the associated entries 
in the distributed hash table must be updated. 

When the number of resources in the mobile host is large, updating all of 
the entries so that remote hosts can properly use resources in the mobile host 
can take a long time. Therefore we proposed an alternative approach, where 
the entry keyed by the resource identifier contained only a host identifier and 
host-specific path for the resource, and the host identifier itself is a key to a 
host entry containing the actual network address for the mobile host. 

With our proposed approach, only the host entry needs to be updated when 
the mobile host changes its network address. This can drastically reduce the 
delay during which its resources cannot be resolved to their current network 
locations, as was shown theoretically and experimentally. However, there is a 
small increase in the time required for resolving identifiers with our approach, 
so we also discussed under which circumstances it should not be used. 

In our work, we only consider whether to use direct or indirect entries given 
static lookup and migration rates. It would be interesting to see how an adaptive 
system could dynamically adjust the approach used in order to achieve optimal 
performance with changing lookup and migration rates, since such a system 
would also have to consider the overhead for switching between the two. 

It may also be possible to apply system-specific optimizations to our ap- 
proach. For example, while our approach can be applied to any type of dis- 
tributed hash table, it could be possible to reduce the lookup overhead when it 
is applied on top of OpcnDHT by taking advantage of the RcDiR framework. 

We plan to apply our approach in a decentralized and unified naming system 
we are developing, where it would improve performance for identifying resources 
such as files and network services located inside mobile devices in a persistent 
and location-independent manner. 
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